Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Openmp #439

Closed
wants to merge 3 commits into from
Closed

Openmp #439

wants to merge 3 commits into from

Conversation

borisgin
Copy link

Parallel version of caffe for CPU based on OpenMP. Significant speed-up on CPU .Scales well with number of cores (3x for 4 cores , 10x - for 16 cores). Modified files: constitutional , pulling and relu layers , im2col and col2im.

@Yangqing
Copy link
Member

Is there a benchmark showing how much speedup it brings compared to a multi-thread blas implementation? For many layers (e.g. convolution) having a multithread blas is already fast and making explicit openmp parallelization makes code more complex with little improvement.

@Yangqing Yangqing mentioned this pull request May 22, 2014
@luotao1
Copy link

luotao1 commented May 23, 2014

I have already tried the openmp version on im2col_cpu and col2im_cpu in conv_layer.cpp. But in my test, this change would decrease the performance of caffe_cpu_gemm in conv_layer.cpp, and the total elapsed time increased. Thus, I have to choose the pthread version on im2col_cpu and col2im_cpu. see pull #400. As far as now, I didn't find the reason why openmp version would decrease the performance of gemm.

@borisgin
Copy link
Author

I do convolutional Forward() and Backward() in parallel for different images in the batch. As a result CPU code is much faster now: on desktop i saw 3.5x speed-up, and on server with 16 cores ~ 10x speed-up. Still it is not as fast as GPU. The change which I did in im2col and col2im was not related to OpenMP: I removed division by modification of for ( c....) , and added early exit for boundary cases.

@borisgin borisgin closed this May 23, 2014
@borisgin
Copy link
Author

On benchmarks: I used cifar10 and imagenet train as bechmarks. I compared OpenMP with current dev version (CPU). i used 2 machines for testing: desktop (Ivybriidge, 1 socket x 4 cores x no HT).
and server (Sandybridge - 2 sockets x 8 cores x no HT)
cifar10 on desktop ": speed-up is 3.5x.
imagenet server: speed-up ~10x .

@luotao1
Copy link

luotao1 commented May 23, 2014

I saw your code. You do parallel on #pragma omp parallel for //shared(bottom,top) for (int n = 0; n < num_; ++n ) in Forward_cpu, is this right? This for iteration is data dependence, the i th result depends on the (i-1) th's, it can't be parallel.

@borisgin
Copy link
Author

Forward() can be done independently for each image n , but I had to multiplicate buffer for im2col to avoid collision between threads. For Backward() I multiplicated buffer for weight_dif, and summed them all after all threads finish their jobs.

@borisgin
Copy link
Author

I tested openmp version with MKL and found some issues which I would like to investigate.

@borisgin
Copy link
Author

Fixed Makefile to support openmp with MKL. Current benchmark results:
cifar10/ train_quick):
desktop Core i7-3770K CPU @ 3.50GHz × 4 cores x no HT): 100 sec
K20: 90 sec
imagenet/quick train ( 200 train iteartion with 100 test iteration , each 100 iteration):
server 2 x Xeon E5-2680 @ 2.7 GHz) x 8 cores : 953 sec
K20: 640 sec

@borisgin borisgin reopened this May 26, 2014
@borisgin
Copy link
Author

Fixed OpenMP + MKL . Tested on imagenet (200 train iteration).Difference between CPU ( 2 socket server Xeon E5-2680 ) and GPU(K20) is 1.5x ( CPU is slower) /* For cifar10 CPU is faster */

@borisgin
Copy link
Author

To reproduce results with MKL, you should 1. get a free non-commercial version here: https://software.intel.com/en-us/non-commercial-software-development
2. set environment (LD_LIBRARY_PATH):
export LD_LIBRARY_PATH=/opt/intel/composerxe/lib/intel64:/opt/intel/composerxe/mkl/lib/intel64:/opt/intel/mkl/lib/intel64:/usr/local/cuda-6.0/lib64:/usr/local/cuda-6.0/lib:/usr/local/lib:$LD_LIBRARY_PATH
export LIBRARY_PATH=/opt/intel/composerxe/lib/intel64:/opt/intel/composerxe/mkl/lib/intel64:$LIBRARY_PATH

@eladhoffer
Copy link

Good job! will take a look

col_buffer_mt_.resize(num_of_threads_ *
channels_ * kernel_size_ * kernel_size_ * height_out * width_out);
weight_diff_mt_.resize(num_of_threads_ *
num_output_ * (channels_ / group_)* kernel_size_ * kernel_size_);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As has also been pointed out in the discussion, I am worried about the intermediate data size: for example, for an input of size 55_55_256 and kernel size 3*3 (I am just making random numbers, may not correspond to actual imagenet layers) and convolution is with stride 1, a single buffer will have size

55 * 55 * 3 * 3 * 256 * sizeof(float) = 28MB

with multi threads (like 10) this will grow to like 280MB, and with multiple convolutional layers, this may quickly grow gigabytes. That's why I feel that we should rely on multithreading BLAS to speed things up rather than having a single-thread BLAS and explicit openmp code.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I tried to run caffe with large net OverFeat ( http://cilvr.nyu.edu/doku.php?id=software:overfeat:start ) on K20, but model did not fit K20 DRAM. I was able to train it on CPU. The footprint for 16 threads was ~ 16 GB(!). The train speed was ~ 5 images/sec. I used server with 32 GB , 2x E5x2670 2.6 GHz.

@borisgin
Copy link
Author

MEMORY: Total memory overhead is realtively small . I run imagenet train on my desktop with OMP_NUM_THREADS= 4, and never see memory utilization above 5 GB. Since default configuration of desktop these days is 8 GB, this overhead seems to be not an issue. Even for servers, when I set OMP_NUM_THREADS= 24, the size was still around 5-5.5 GB.
SPEED-UP: OpenMP is 3x faster than best multi-threaded BLAS (MKL). For example , for imagenet training . for regular middle-range desktop the difference in speed of CPU vs K20 is ~ 2.5x. For 2 years old server the difference is even less: 1.3 - 1.4x.
Having fast CPU code is very convenient when someone want to add his algorithm, or modify existing one

@jeffdonahue
Copy link
Contributor

Your benchmarks are very impressive, thanks for reporting these numbers! I'll hopefully be able to give this a try at some point, and would definitely be in favor of merging this if I see speed/memory numbers anywhere near what you report (perhaps at first into a new branch off dev, to make it more easily available while doing further testing and refinement if necessary). I'm pretty busy over the next few weeks though; maybe somebody else will beat me to it.

@borisgin
Copy link
Author

These are setup details:

  • Desktop: CPU i7-4770 (Haswell), 3.5 GHz , DRAM - 16 GB; GPU K20.
  • Ubuntu 12.04; gcc 4.7.3; MKL 11.1.

Test:: imagenet, 100 train iteration (batch = 256).

  • GPU: time= 260 sec / memory = 0.8 GB
  • CPU: time= 752 sec / memory = 3.5 GiB
    //Memory data is from system monitor.

@jeffdonahue
Copy link
Contributor

Hi @borisgin, sorry it's taken me forever to get around to this.

Have you tested at all with OpenBLAS, or only MKL? I tried this out with OpenBLAS, and didn't see performance improve on a machine with 32

Here's what I did: Initially I was using the system OpenBLAS library, but I saw a bunch of error messages when I ran the training to recompile OpenBLAS with the option USE_OPENMP=1 defined. So I did that, linked to my new OpenBLAS build, and the error messages went away.

Using your branch, these are the results I see for imagenet training (.examples/imagenet/train_imagenet.sh, adding display: 1 and solver_mode: CPU to the bottom of imagenet_solver.prototxt):

I0716 16:53:59.961927 27214 solver.cpp:86] Solving CaffeNet
I0716 16:54:27.538390 27214 solver.cpp:272] Iteration 1, lr = 0.01
I0716 16:54:28.103991 27214 solver.cpp:112] Iteration 1, loss = 7.44474
I0716 16:54:52.166633 27214 solver.cpp:272] Iteration 2, lr = 0.01
I0716 16:54:52.470783 27214 solver.cpp:112] Iteration 2, loss = 7.34914
I0716 16:55:16.940089 27214 solver.cpp:272] Iteration 3, lr = 0.01
I0716 16:55:17.182062 27214 solver.cpp:112] Iteration 3, loss = 7.27532
I0716 16:55:42.486449 27214 solver.cpp:272] Iteration 4, lr = 0.01
I0716 16:55:42.766113 27214 solver.cpp:112] Iteration 4, loss = 7.35052
I0716 16:56:07.605473 27214 solver.cpp:272] Iteration 5, lr = 0.01
I0716 16:56:07.901649 27214 solver.cpp:112] Iteration 5, loss = 7.34963
I0716 16:56:33.244210 27214 solver.cpp:272] Iteration 6, lr = 0.01
I0716 16:56:33.531226 27214 solver.cpp:112] Iteration 6, loss = 7.49167
I0716 16:57:00.126989 27214 solver.cpp:272] Iteration 7, lr = 0.01
I0716 16:57:00.405313 27214 solver.cpp:112] Iteration 7, loss = 7.42552
I0716 16:57:26.207607 27214 solver.cpp:272] Iteration 8, lr = 0.01
I0716 16:57:26.506078 27214 solver.cpp:112] Iteration 8, loss = 7.49404
I0716 16:57:53.265570 27214 solver.cpp:272] Iteration 9, lr = 0.01
I0716 16:57:53.539079 27214 solver.cpp:112] Iteration 9, loss = 7.34146
I0716 16:58:21.023488 27214 solver.cpp:272] Iteration 10, lr = 0.01
I0716 16:58:21.394134 27214 solver.cpp:112] Iteration 10, loss = 7.4269
I0716 16:58:48.360399 27214 solver.cpp:272] Iteration 11, lr = 0.01
I0716 16:58:48.687850 27214 solver.cpp:112] Iteration 11, loss = 7.36875

Around 26 seconds per iteration. Then, I reran with export OMP_NUM_THREADS=8:

I0716 16:59:13.466909 27348 solver.cpp:86] Solving CaffeNet
^[[O^[[I^[[O^[[II0716 16:59:43.876104 27348 solver.cpp:272] Iteration 1, lr = 0.01
I0716 16:59:44.445726 27348 solver.cpp:112] Iteration 1, loss = 7.44474
I0716 17:00:11.324043 27348 solver.cpp:272] Iteration 2, lr = 0.01
I0716 17:00:11.571719 27348 solver.cpp:112] Iteration 2, loss = 7.34914
I0716 17:00:38.797961 27348 solver.cpp:272] Iteration 3, lr = 0.01
I0716 17:00:39.049752 27348 solver.cpp:112] Iteration 3, loss = 7.27533
I0716 17:01:06.600169 27348 solver.cpp:272] Iteration 4, lr = 0.01
I0716 17:01:06.855245 27348 solver.cpp:112] Iteration 4, loss = 7.35052
I0716 17:01:34.575659 27348 solver.cpp:272] Iteration 5, lr = 0.01
I0716 17:01:34.832867 27348 solver.cpp:112] Iteration 5, loss = 7.34962
I0716 17:02:02.975780 27348 solver.cpp:272] Iteration 6, lr = 0.01
I0716 17:02:03.227632 27348 solver.cpp:112] Iteration 6, loss = 7.4917
I0716 17:02:32.491924 27348 solver.cpp:272] Iteration 7, lr = 0.01
I0716 17:02:32.749202 27348 solver.cpp:112] Iteration 7, loss = 7.42552
I0716 17:03:01.581933 27348 solver.cpp:272] Iteration 8, lr = 0.01
I0716 17:03:01.837805 27348 solver.cpp:112] Iteration 8, loss = 7.49396
I0716 17:03:31.338582 27348 solver.cpp:272] Iteration 9, lr = 0.01
I0716 17:03:31.592015 27348 solver.cpp:112] Iteration 9, loss = 7.34149
I0716 17:04:01.258724 27348 solver.cpp:272] Iteration 10, lr = 0.01
I0716 17:04:01.507393 27348 solver.cpp:112] Iteration 10, loss = 7.42689
I0716 17:04:31.262023 27348 solver.cpp:272] Iteration 11, lr = 0.01
I0716 17:04:31.515398 27348 solver.cpp:112] Iteration 11, loss = 7.36892

That's ~29 seconds per iteration.

I'm using an 8 core machine.

Do you think I could've done something wrong with the setup, and/or do you expect this will only work well with MKL? I can retry it with MKL if you think that's the problem. (I notice that when I open "top" running with just 1 OMP thread, I see CPU utilization often >1000%, so it seems like the blas library is doing quite a bit of parallelization already as @Yangqing suggested.)

@luotao1
Copy link

luotao1 commented Jul 17, 2014

I test borisgin's improvement, it indeed speeds up on CPU, but not scales well like him said. Our result is 3x+ for 16 cores. @jeffdonahue , is your 8 core machine all physical cores?

@jeffdonahue
Copy link
Contributor

Ah, nice catch -- there are actually only 2 physical cores on the machine I was using. I had looked at /proc/cpuinfo and misinterpreted the result (even knowing that it is easily misinterpretable)...it prints 32 "processors" with cpu cores: "8", but in fact the "physical id" is only 0 or 1 so I guess I have 2 physical cores. I will try again on a more powerful machine, sorry.

@OpenHero
Copy link

@jeffdonahue it means you have 2 physical cpus, each cpu has 8 cores, each core have 2 hyper-thread. So you can get 32 threads( processor).

  1. Get the CPU

grep 'physical id' /proc/cpuinfo | sort -u
2. Get the core

grep 'core id' /proc/cpuinfo | sort -u | wc -l
3. Get the threads

grep 'processor' /proc/cpuinfo | sort -u | wc -l
4.Get the cpu version

dmidecode -s processor-version

Check the OpenBLAS compiler command with OpenMP. Or set them with "void openblas_set_num_threads(int num_threads);"

By default with USE_OPENMP=1 it will use all the threads said by @xianyi

@luotao1
Copy link

luotao1 commented Jul 17, 2014

@jeffdonahue your machine has 2 cpu, each cpu has 8 core, So the number of physical core is 16, the processor(hyper-thread)=32. Thus, you can set OMP_THREADS_NUM=16. But how is your batchsize? when using openmp, the overhead of creating threads is not trivial, you can not use very small batchsize.

@borisgin
Copy link
Author

Hi, @jeffdonahue , I did not tested OpenMP + OpenBLAS combination. I tested it with SMT (hyper-threading disabled in BIOS, since it does not help for matrix multiplication at all. Did you re-build OpenBLAS for your CPU? I have server with 2 sockets x E5-2680 , each CPU has 8 cores/8 threads, 2.7 GHZ. I run imagenet train with batch size = 256 for 100 iterations.
K20 : 305 sec.
CPU with BLAS = atlas : 18517 sec,
CPU with BLAS = open : 1916 sec
CPU with BLAS = mkl: 1443
CPU with BLAS = mkl and with OpenMP: 419 sec

@xianyi
Copy link

xianyi commented Jul 17, 2014

Hi , @borisgin ,
Thank you for the test. What's OpenBLAS version?

@borisgin
Copy link
Author

First I used what I get with "sudo apt-get install libopenblas-base"
but then I download OpenBLAS xianyi-OpenBLAS-347dded and rebuild it since I wanted to retest for Haswell CPU with AVX2

@xianyi
Copy link

xianyi commented Jul 17, 2014

Hi , @borisgin ,

Intel Xeon E5-2680 is sandy bridge arch, not haswell.

For sandy bridge arch, OpenBLAS 0.2.9 rolled back the sgemm kernel to the old core2 kernel. In latest 0.2.10 version, we enabled optimized sgemm kernel for sandy bridge arch.

Could you retest the performance with the new OpenBLAS version?

@borisgin
Copy link
Author

I got new desktop with Haswell CPU, so I wanted to rebuild the code to check how new AVX2 instructions (FMA) impact the performance. I will retest on it the performance for OpenBLAS vs (OpenBlas + OpenMP).

@bhack
Copy link
Contributor

bhack commented Jul 17, 2014

@borisgin we need to evaluate that on Osx with clang we will not have openmp available on the shelf.
Please take a look here #689

@borisgin
Copy link
Author

Hi, @ xianyi,
I retested BLAS = open with OpenMP.
System configuration: desktop with CPU i7-4960X CPU.
OpenBLAS = 0.2.9, build with gcc 4.6.4 "make USE_OPENMP=1"
Test: imagenet_train, batch = 256, time below is for 10 train iterations (+ small overhead in the start and in the end):
BLAS = open: 169 sec
BLAS = open + openMP: 124 sec
BLAS = mkl: 97 sec
BLAS = mkl + openMP : 62 sec

@borisgin
Copy link
Author

Rebased openmp with latest dev branch:

  1. added openmp tol lrn layer
  2. fixed the performance bug related to mkl alternative for powx(...) function

@borisgin
Copy link
Author

I found curious performance bug, related to MKL alternative implementation of functions which are lacking in OpenBLAS.
PROBLEM: I noticed that when you run speed benchmark with OpenBLAS, the normalization layer takes very long time - 1193 msec.. When I replaced openBLAS with MKL, the normalization layer took only 50 msec.
When I did profiling, turned out that the most of the time is spen in lrn_layer.cpp t in following function:
caffe_powx(scale_.count(), scale_data, -beta_, top_data);
which is tyhe reimplementation of MKL function in mkl_alternate.hpp
DEFINE_VSL_UNARY_FUNC_WITH_PARAM(Powx, y[i] = pow(a[i], b));

I did some quick fix in lrn_layer.cpp:
#ifdef USE_MKL
caffe_powx(scale_.count(), scale_data, -beta_, top_data);
caffe_mul(scale_.count(), top_data, bottom_data, top_data);
#else
#ifdef OPENMP
#pragma omp parallel for
#endif
for (int i = 0; i < scale
.count(); i++) {
// top_data[i] = pow(scale_data[i],-beta_) * bottom_data[i];
top_data[i] = exp(log(scale_data[i]) *(-beta_)) * bottom_data[i];
}
#endif
}
But this version it is still slower than MKL, which is based on vector version of powd.

@xianyi, any plans to add fast powd to OpenBLAS?

Thanks, Boris

caffe with OPENBLAS
....
Average time per layer:
I1117 14:03:46.019644 17778 caffe.cpp:265] data forward: 16.3546 ms.
I1117 14:03:46.019652 17778 caffe.cpp:268] data backward: 0.00066 ms.
I1117 14:03:46.019661 17778 caffe.cpp:265] conv1 forward: 837.748 ms.
I1117 14:03:46.019670 17778 caffe.cpp:268] conv1 backward: 656.368 ms.
I1117 14:03:46.019681 17778 caffe.cpp:265] relu1 forward: 59.143 ms.
I1117 14:03:46.019690 17778 caffe.cpp:268] relu1 backward: 88.6836 ms.
I1117 14:03:46.019701 17778 caffe.cpp:265] pool1 forward: 477.523 ms.
I1117 14:03:46.019711 17778 caffe.cpp:268] pool1 backward: 111.665 ms.
I1117 14:03:46.019722 17778 caffe.cpp:265] norm1 forward: 1193.3 ms.
I1117 14:03:46.019732 17778 caffe.cpp:268] norm1 backward: 1229.55 ms.
I1117 14:03:46.019742 17778 caffe.cpp:265] conv2 forward: 1174.91 ms.
I1117 14:03:46.019753 17778 caffe.cpp:268] conv2 backward: 2174.22 ms.
I1117 14:03:46.019763 17778 caffe.cpp:265] relu2 forward: 38.0219 ms.
I1117 14:03:46.019774 17778 caffe.cpp:268] relu2 backward: 57.023 ms.
I1117 14:03:46.019783 17778 caffe.cpp:265] pool2 forward: 335.518 ms.
I1117 14:03:46.019794 17778 caffe.cpp:268] pool2 backward: 70.0151 ms.
I1117 14:03:46.019804 17778 caffe.cpp:265] norm2 forward: 667.69 ms.

I1117 14:03:46.019815 17778 caffe.cpp:268] norm2 backward: 692.64 ms.

caffe with MKL:
I1119 10:58:23.669105 29606 caffe.cpp:262] Average time per layer:
I1119 10:58:23.669136 29606 caffe.cpp:265] data forward: 16.7308 ms.
I1119 10:58:23.669189 29606 caffe.cpp:268] data backward: 0.00088 ms.
I1119 10:58:23.669239 29606 caffe.cpp:265] conv1 forward: 745.611 ms.
I1119 10:58:23.669287 29606 caffe.cpp:268] conv1 backward: 640.988 ms.
I1119 10:58:23.669335 29606 caffe.cpp:265] relu1 forward: 62.2671 ms.
I1119 10:58:23.669383 29606 caffe.cpp:268] relu1 backward: 82.4027 ms.
I1119 10:58:23.669430 29606 caffe.cpp:265] pool1 forward: 490.123 ms.
I1119 10:58:23.669476 29606 caffe.cpp:268] pool1 backward: 127.349 ms.
I1119 10:58:23.669522 29606 caffe.cpp:265] norm1 forward: 51.0725 ms.
I1119 10:58:23.669569 29606 caffe.cpp:268] norm1 backward: 63.3013 ms.
I1119 10:58:23.669620 29606 caffe.cpp:265] conv2 forward: 1086.59 ms.
I1119 10:58:23.669667 29606 caffe.cpp:268] conv2 backward: 2024.77 ms.
I1119 10:58:23.669713 29606 caffe.cpp:265] relu2 forward: 39.8503 ms.
I1119 10:58:23.673890 29606 caffe.cpp:268] relu2 backward: 51.6888 ms.
I1119 10:58:23.673945 29606 caffe.cpp:265] pool2 forward: 330.297 ms.
I1119 10:58:23.673995 29606 caffe.cpp:268] pool2 backward: 79.1423 ms.
I1119 10:58:23.674042 29606 caffe.cpp:265] norm2 forward: 33.7946 ms.

I1119 10:58:23.674092 29606 caffe.cpp:268] norm2 backward: 47.9396 ms.

caffe with MKL and OpenMP
I1119 10:42:58.466241 28810 caffe.cpp:262] Average time per layer:
I1119 10:42:58.466248 28810 caffe.cpp:265] data forward: 17.1041 ms.
I1119 10:42:58.466264 28810 caffe.cpp:268] data backward: 0.00084 ms.
I1119 10:42:58.466272 28810 caffe.cpp:265] conv1 forward: 502.285 ms.
I1119 10:42:58.466284 28810 caffe.cpp:268] conv1 backward: 397.606 ms.
I1119 10:42:58.466295 28810 caffe.cpp:265] relu1 forward: 30.1847 ms.
I1119 10:42:58.466306 28810 caffe.cpp:268] relu1 backward: 44.3305 ms.
I1119 10:42:58.466316 28810 caffe.cpp:265] pool1 forward: 146.703 ms.
I1119 10:42:58.466327 28810 caffe.cpp:268] pool1 backward: 51.1912 ms.
I1119 10:42:58.466337 28810 caffe.cpp:265] norm1 forward: 42.2056 ms.
I1119 10:42:58.466347 28810 caffe.cpp:268] norm1 backward: 44.3377 ms.
I1119 10:42:58.466359 28810 caffe.cpp:265] conv2 forward: 719.682 ms.
I1119 10:42:58.466370 28810 caffe.cpp:268] conv2 backward: 1408.03 ms.
I1119 10:42:58.466383 28810 caffe.cpp:265] relu2 forward: 19.3619 ms.
I1119 10:42:58.466393 28810 caffe.cpp:268] relu2 backward: 29.7551 ms.
I1119 10:42:58.466404 28810 caffe.cpp:265] pool2 forward: 100.753 ms.
I1119 10:42:58.466415 28810 caffe.cpp:268] pool2 backward: 33.027 ms.
I1119 10:42:58.466426 28810 caffe.cpp:265] norm2 forward: 25.7057 ms.
I1119 10:42:58.466438 28810 caffe.cpp:268] norm2 backward: 29.2872 ms.

@xianyi
Copy link

xianyi commented Nov 24, 2014

@borisgin , I already added a feature request for this function.

@forresti
Copy link
Contributor

Is this PR (or something similar) going to be merged soon?

When I checked not too long ago, CPU Caffe is unnecessarily slow without OpenMP. I'm debating tacking my own OpenMP things together... but merging this PR with present-day Caffe would be optimal.

@talda
Copy link

talda commented Jun 24, 2015

I wouldn't hold my breath. This PR is more than a year old and it doesn't seem the maintainers think it should be merged. (probably better to close it)

@shelhamer
Copy link
Member

While CPU execution can be further optimized this PR is closed since it is against the deprecated dev branch. This branch was not merged at the time due to concerns about further complexity and dependencies. Thanks for your work @borisgin.

@shelhamer shelhamer closed this Aug 26, 2015
@bhack
Copy link
Contributor

bhack commented Aug 26, 2015

@naibaf7 It is working also on CPU at #2610

@naibaf7
Copy link
Member

naibaf7 commented Aug 26, 2015

@shelhamer @bhack
#2610 Uses OpenCL kernels and a CPU (MKL, OpenBLAS, Atlas) BLAS.
While OpenCL might not be the best for CPUs, the Alexnet runs twice as fast on Hybrid-OpenCL (that's what I call it) than the "legacy" singlethreaded CPU backend on a quadcore CPU.

@Crefeda
Copy link

Crefeda commented Jan 4, 2016

Hi, I was try to build your openmp version, on a CentOS 6.5 computer and I got a protobuf version error
my current one is libprotoc 2.4.1
could you let me know which version was compatible for you?

@borisgin
Copy link
Author

borisgin commented Jan 5, 2016

Hi Crefeda,
Can you post a detailed error, please?
Did caffe-master branch build passed without issues?
Thanks, Boris

@Yangqing
Copy link
Member

Yangqing commented Jan 5, 2016

@Crefeda If I recall correctly, caffe relies on some features that are introduced in protobuf 2.5, so I think 2.5 and above should work.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet